Candidate number: 1047951

Table of Contents.

1. Data Cleaning & Analysis

1.1. Missing Values - Replacement

1.2. Missing Values - Visualization

1.2.1. Plotting heatmap of NULL values

1.2.2. Plotting barchart

1.2.3. Plotting boxplot of respondents

1.2.4. Plotting boxplot of waves

1.2.5. Combining the plots

1.3. Dropping missing values and no variation

1.4. Transforming non-numerical columns

1.5. Getting different functional forms

2. Lasso - Continuous variables

2.1. Splitting the data and tuning alpha

2.2. Visualizing test errors

2.3. Building a function to get the data for predictions

2.4. Lasso prediction on leaderboard data

2.5. Lasso - quantile transform

2.5.1. Plotting outcome variables

2.5.2. Visualizing GPA transformation

2.5.3. Calculatting Lasso Quantile R^2

2.5.4. Testing Quantile model on leaderboard data

3. LightGBM

3.2. Fitting categorical variables lightGBM

3.3. Predicting continuous variables

3.4. Evaluating LightGBM performance

3.5.1. Importing data

3.5.2. Plotting MSE

4. Fitting Lasso residuals with LightGBM

4.1. Fitting the model

4.2. Plotting the residuals

4.3. Getting the data for residual fits

4.4. Fitting residuals on the leaderboard data

4.5. Calculating t-test statistic for whether there is any relationship between the predicted and actual results

4.6. Plotting residual predictions for each group

5. Running LightGBM, RF, and RF + MI

5.1. Setup

5.2. Plotting the brier loss

5.3. Evaluating the model on the leaderboard

5.4. Random forest prediction (legacy)

5.5. Plotting the historical RF data (legacy)

5.7. Mutual information regression

5.8. Visualizing Random Forest without Mutual Information

5.9. Visualizing random forest with mutual information

6. Classification - Ensemble model (by vote)

7. Shapley values

8. Do high leverage points impact bad predicions?

8.1. Fitting the model

8.2. Visualizing GPA, predicted vs real

8.3. Calculating z-scores

8.4. Plotting

9. Other

9.1. Calculating the statistic that each variable comes from some distribution

1. Data Cleaning & Analysis

1.1. Missing Values - Replacement

1.2. Missing Values - Visualization

1.2.1. Plotting heatmap of NULL values

1.2.2. Plotting barchart

1.2.3. Plotting boxplot of respondents

1.2.4. Plotting boxplots of waves

1.2.5. Combining the plots

1.3. Dropping missing values and no variation

1.4. Transforming non-numerical columns

Look at the data types

1.5. Getting different functional forms

2. Lasso - Continuous variables

2.1. Splitting the data and tuning alpha

2.2. Visualizing test errors

2.3. Building a function to get the data for predictions

2.4. Lasso prediction on leaderboard data

2.5. Lasso - quantile transform

Using data with no additional functional forms

2.5.1. Plotting outcome variables

2.5.2. Visualizing GPA transformation

2.5.3. Calculating Lasso Quantile R^2

2.5.4. Testing Quantile model on leaderboard data

3. LightGBM

3.2. Fitting Categorical variables lightGBM

Posting this code below that was used for the actual evaluation of the algorithm (so that the CV data would always be imputed by the mean of the train folds)

3.3. Predicting continuous variables

3.4. Evaluating LightGBM Performance

3.5.1. Importing data

3.5.2. Plotting MSE

4. Fitting Lasso residuals with LightGBM

4.1. Fitting the model

4.2. Plotting residuals

4.3. Getting the data for residual fits

4.4. Fitting residuals on the leaderborad data

4.5. Calculating t-test statistic for whether there is any relationship between the predicted and real residuals

4.6. Plotting residual predictions for each group

5. Running LightGBM, RF, and RF + MI

5.1. Setup

5.2. Plotting the brier loss

5.3. Evaluating the model on the leaderboard

5.4. Random Forest prediction (legacy)

This is not used in the final paper because the trees were too deep and could nfind relationships

5.5. Plotting the historical RF data (legacy)

The data trained here was trained on bad hyperparameters (really deep forests that performed badly). This is why the results are like this. They are not included in the paper

5.7. Mutual Information Regression

5.8. Visualizing Random Forest without Mutual Information

5.9. Visualizing Random Forest with Mutual Information

6. Classification - Ensemble model (by vote)

Model 1: LightGBM
Model 2: RF
Model 3: RF + MI

Get the best params of all models

Get train and test sets

7. Shapley values

8. Do high leverage points impact bad predictions?

8.1. Fitting the model

  1. Fit Lasso model on scaled data
  2. Scale test data
  3. Get predictions
  4. Get residuals
  5. Plot predictions vs real GPA
  6. Highlight the points that are close to each other
  1. Get Lasso model

8.2. Visualizing GPA, predicted vs real

8.3. Calculating z-scores

  1. Get the most important features of Lasso
  2. Get the feature value of each observation
  3. Calculate the standard deviation and mean of each feature (train set)
  4. Calculate the z-score for all observations
  5. Rank them
  1. Get the indidces that have high and low residuals
  2. Calculate the mean z-scores for them by feature

8.4. Plotting

9. Other

9.1. Calculating the statistic that each variable comes from some distribution